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Voice Conversion System and Methodology 

RELATED APPLICATIONS 

This application claims the benefit of U.S. Provisional Application No. 
60/036,227, entitled "Voice Conversion by Segmental Codebook Mapping of Line 
Spectral Frequencies and Excitation System," filed on January 27, 1997 by Levent M. 
Arslan and David Talkin, incorporated herein by reference. 



FIELD OF THE INVENTION 

The present invention relates to voice conversion and, more particularly, to 
10 codebook-based voice conversion systems and methodologies. 

BACKGROUND OF THE INVENTION 

A voice conversion system receives speech from one speaker and transforms the 
speech to soimd like the speech of another speaker. Voice conversion is useful in a 
variety of applications. For example, a voice recognition system may be trained to 
15 recognize a specific person's voice or a nomialized composite of voices. Voice 
conversion as a front-end to the voice recognition system allows a new person to 
effectively utilize the system by converting the new person's voice into the voice that the 
voice recognition system is adapted to recognize. As a post processing step, voice 
conversion changes the voice of a text-to-speech synthesizer. Voice conversion also has 

20 applications in voice disguising, dialect modification, foreign-language dubbing to retain 

... ' . ■ ' 

the voice of an original actor, and novelty systems such as celebrity voice impersonation, 

for example, in Karaoke machines. 

In order to convert speech from a "source" voice to a **target" voice, codebooks of 

the source voice and target voice are typically prepared in a training phase. A codebook 

25 is a collection of "phones," which are units of speech soimds that a person utters. For 
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example, the spoken English word "cat" in the General American dialect comprises three 
phones [K], [AE], and [T], and the word "cot" comprises three phones [K], [AA], and 
[T]. In this example, "cat" and "cot" share the initial and final consonants but employ 
different vowels. Codebooks are structured to provide a one-to-one mapping between the 
5 phone entries in a source codebook and the phone entries in the target codebook. 

U.S. Patent No. 5,327,521 describes a conventional voice conversion system 
using a codebook approach. An input signal from a source speaker is sampled and 
preprocessed by segmentation into "frames" corresponding to a speech unit. Each fi-ame 
is matched to the "closest" source codebook entry and then mapped to the corresponding 

1 0 target codebook entry to obtain a phone in the voice of the target speaker. The mapped 
firames are concatenated to produce speech in the target voice. A disadvantage with this 
and similar conventional voice conversion systems is the introduction of artifacts at frame 
boundaries leading to a rather rough transition across target frames. Furthermore, the 
variation between the sound of the input speech frame and the closest matching source 

1 5 codebook entry is discarded, leading to a low quality voice conversion. 

A common cause for the variation between the sounds in speech and in codebook 
is that sounds differ depending on their position in a word. For example, the /t/ phoneme 
has several "allophones." At the beginning of a word, as in the General American 
pronunciation of the word "top", the /t/ phoneme is an unvoiced, fortis, aspirated, 

20 alveolar stop. In an initial cluster with an /s/, as in the word "stop," it is an unvoiced, 

fortis, unaspirated, alveolar stop. In the middle of a word between vowels, as in "potter," 
it is an alveolar flap. At the end of a word, as in "pot," it is an unvoiced, lenis, 
unaspriated, alveolar stop. Although the allophones of a consonant like N are 
pronounced differently, a codebook with only one entry for the N phoneme will produce 

25 only one kind of IM sound and, hence, unconvincing output. Prosody also accounts for 
differences in sound, since a consonant or vowel will sound somewhat different when 
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spoken at a higher or lower pitch, more or less rapidly, and with greater or lesser 
emphasis. 

Accordingly, one conventional attempt to improve voice conversion quality is to 
greatly increase the amount of training data and the number of codebook entries to 
5 account for the different allophones of the same phoneme and different prosodic 

conditions. Greater codebook sizes lead to increased storage and computational costs. 
Conventional voice conversion systems also suffer in a loss of quality because they 
typically perform their codebook mapping in an acoustic space defined by linear 
predictive coding coefficients. Linear predictive coding is an all-pole modeling of speech 
10 and, hence, does not adequately represent the zeroes in a speech signal, which are more 
commonly found in nasal and sounds not originating at the glottis. Linear predictive 
coding also has difficulties with higher pitched sounds, for example, women's voices and 
children's voices. 

SUMMARY OF THE INVENTION 

^ ^ There exists a need for a voice conversion system and methodology having 

improved quality output, but preferably still computationally tractable. Differences in 
soimd due to word position and prosody need to be addressed without increasing the size 
of codebooks. Furthermore, there is a need to account for voice features that are not well 
supported by linear predictive coding, such as the glottal excitation, nasalized sounds, 

20 and soxmds not originating at the glottis. 

Accordingly, one aspect of the invention is a method and a computer-readable 
medixun bearing instructions for transforming a source signal representing a source voice 
mto a target signal representing a target voice. The source signal is preprocessed to 
produce a source signal segment, which is compared with source codebook entries to 

25 produce corresponding weights. The source signal segment is transformed into a target 
signal segment based on the weights and corresponding target codebook entries and post 
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processed to generate the target signal. By computing a weighted average, a composite 
source voice can be mapped to a corresponding composite target voice, thereby reducing 
artifacts at frame boundaries and leading to smoother transitions between frame 
boundaries without having to employ a large number of codebook entries. 

In another aspect of the invention, the source signal segment is compared with the 
source codebook entries as line spectral frequencies to facilitate the computation of the 
weighted average. In still another aspect of the invention, the weights are refined by a 
gradient descent analysis to fiuther improve voice quality. In a ftirther aspect of the 
mvention, both vocal tract characteristics and excitation characteristics are transformed 
according to the weights, thereby handling excitation characteristics in a computationally 
tractable manner. 

Additional needs, objects, advantages, and novel features of the present invention 
will be set forth in part in the description that follows, and in part, will become apparent 
upon examination or may be learned by practice of the invention. The objects and 
15 advantages ofthe invention may be realized and obtained by means of the 

instrumentalities and combinations particularly pointed out in the appended claims. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The present invention is illustrated by way of example, and not by way of 
limitation, in the figures of the accompanying drawings and in which like reference 
20 numerals refer to similar elements and in which: 

Fig. 1 schematically depicts a computer system that can implement the present 
invention; 

Fig. 2 dqjicts codebook entries for a source speaker and a target speaker. 
Fig 3 is a flowchart illustrating the operation of voice conversion according to an 
25 embodiment of the present invention; 
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Fig. 4 is a flowchart illustrating the operation of refining codebook weight by a 
gradient descent analysis according to an embodiment of the present invention; and 

Fig 5 depicts a bandwidth reduction of formants of a weighted target voice 
spectrum according to an embodiment of the present invention. 

5 DESCRIPTION OF THE PREFERRED EMBODIMENT 

A method and apparatus for voice conversion is described. In the following 
description, for the purposes of explanation, numerous specific details are set forth in 
order, to provide a thorough understanding of the present mvention. It will be apparent, 
however, to one skilled in the art that the present invention may be practiced without 
1 0 these specific details. In other instances, well-known structures and devices are shown 
block diagram form in order to avoid unnecessarily obscuring the present invention. 



in 



an 



Hardware Overview 
Figure 1 is a block diagram that illustrates a computer system 100 upon which 
embodiment of the invention may be implemented. Computer system 100 includes a bus 

15 1 02 or other communication mechanism for communicating information, and a processor 
(or a plurality of central processing units working in cooperation) 104 coupled with bus 
102 for processing information. Computer system 100 also includes a main memory 106, 
such as a random access memory (RAM) or other dynamic storage device, coupled to bus 
102 for storing information and instractions to be executed by processor 104. Main 

20 memory 1 06 also may be used for storing temporary variables or other intermediate 

information during execution of instructions to be executed by processor 104. Computer 
system 100 fiirther includes a read only memory (ROM) 108 or other static storage 
device coupled to bus 102 for storing static information and instructions for processor 
1 04. A storage device 1 1 0, such as a magnetic disk or optical disk, is provided and 

25 coupled to bus 1 02 for storing information and instructions. 
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Computer system 100 may be coupled via bus 102 to a display 111, such as a 
cathode ray tube (CRT), for displaying information to a computer user. An input device 
1 13, including alphanumeric and other keys, is coupled to bus 102 for communicating 
information and command selections to processor 104. Another type of user input device 
5 is cursor control 115, such as a mouse, a trackball, or cursor direction keys for 

communicating direction information and command selections to processor 104 and for 
controlling cursor movement on display 1 1 1 . This input device typically has two degrees 
of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the 
device to specify positions in a plane. For audio output and input, computer system 100 

1 0 may be coupled to a speaker 1 1 7 and a microphone 1 1 9, respectively. 

The invention is related to the use of computer system 1 00 for voice conversion. 
According to one embodiment of the invention, voice conversion is provided by 
computer system 100 in response to processor 104 executing one or more sequences of 
one or more instructions contained in main memory 1 06. Such instructions may be read 

1 5 into main memory 1 06 from another computer-readable medium, such as storage device 
1 10. Execution of the sequences of instructions contained in main memory 106 causes 
processor 104 to perform the process steps described herein. One or more processors in a 
multi-processing arrangement may also be employed to execute the sequences of 
instructions contained in main memory 1 06. In alternative embodiments, hard-wired 

20 circuitry may be used in place of or in combination with software instructions to 

implement the invention. Thus, embodiments of the invention are not lunited to any 
specific combination of hardware circuitry and software. 

The tenm "computer-readable medium" as used herein refers to any medium that 
participates in providing instructions to processor 1 04 for execution. Such a medium 

25 may take many forms, including but not limited to, non-volatile media, volatile media, 
and transmission media. Non-volatile media include, for example, optical or magnetic 
disks, such as storage device 1 10. Volatile media include dynamic memory, such as 
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main memory 106. Transmission media include coaxial cables, copper wire and fiber 
optics, including the wires that comprise bus 102. Transmission media can also take the 
form of acoustic or light waves, such as those generated during radio frequency (RF) and 
infi-ared (IR) data communications. Common forms of computer-readable media include, 
5 for example, a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic 
medium, a CD-ROM, DVD, any other optical medium, pxmch cards, paper tape, any 
other physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH- 
EPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or 
any other medium from which a computer can read. 

1 0 Various forms of computer readable media may be involved in carrying one or 

more sequences of one or more instmctions to processor 104 for execution. For example, 
the instmctions may initially be borne on a magnetic disk of a remote computer. The 
remote computer can load the instmctions into its dynamic memory and send the 
mstractions over a telephone line using a modem. A modem local to computer system 

15 1 00 can receive the data on the telephone line and use an infrared transmitter to convert 
the data to an infrared signal. An infrared detector coupled to bus 102 can receive the 
data carried in the infrared signal and place the data on bus 1 02. Bus 1 02 carries the data 
to main memory 106, from which processor 104 retrieves and executes the instmctions. 
The instmctions received by main memory 106 may optionally be stored on storage 

20 device 110 either before or after execution by processor 104. 

Computer system 100 also includes a communication interface 120 coupled to bus 
102. Communication interface 120 provides a two-way data communication coupling to 
a network link 121 that is connected to a local network 122. Examples of communication ' ^ 
interface 120 include an integrated services digital network (ISDN) card, a modem to 

25 provide a data communication cormection to a corresponding type of telephone line, and 
a local area network (LAN) card to provide a data communication coimection to a 
compatible LAN. Wireless links may also be implemented. In any such implementation. 
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conununication interface 120 sends and receives electrical, electromagnetic or optical 
signals that carry digital data streams representing various types of information. 

Network link 121 typically provides data communication through one or more 
networks to other data devices. For example, network link 121 may provide a comiection 
5 through local network 122 to a host computer 124 or to data equipment operated by an 
Internet Service Provider (ISP) 126. ISP 1 26 in turn provides data communication 
services through the world wide packet data communication network, now commonly 
referred to as the "Internet" 128. Local network 122 and Internet 128 both use electrical, 
electromagnetic or optical signals that carry digital data streams. The signals through the 

1 0 various networks and the signals on network link 1 2 1 and through communication 
interface 120, which carry the digital data to and from computer system 100, are 
exemplary forms of canier waves transporting the information. 

Computer system 100 can send messages and receive data, including program 
code, through the netwoik(s), network link 121, and communication interface 120. In the 

1 5 Internet example, a server 1 30 might transmit a requested code for an appUcation 

program through Internet 128, ISP 126, local network 122 and communication interface 
1 18. In accordance with the invention, one such downloaded application provides for 
voice conversion as described herein. The received code may be executed by processor 
104 as it is received, and/or stored in storage device 1 10, or other non-volatile storage for 

20 later execution. In this manner, computer system 100 may obtain application code in the 
form of a carrier wave. 

Source AND Target CoDEBooKS 
In accordance with the present invention, codebooks for the source voice and the 
target voice are prepared as a preliminary step, using processed samples of the source and 
25 target speech, respectively. The number of entries in the codebooks may vary from 

implementation to implementation and depends on a trade-off of conversion quality and 



3NSDOGID: <WO 9835340A2J_> 



wo 98/35340 W M PCT/US98/01538 



computational tractability. For example, better conversion quality may be obtained by 
including a greater number of phones in various phonetic contexts but at the expense of 
increased utilization of computing resources and a larger demand on training data. 
Preferably, the codebooks include at least one entry for every phoneme in the conversion 
5 language. However, the codebooks may be augmented to include allophones of 

phonemes and common phoneme combinations may augment the codebook. Figure 2 
depicts an exemplary codebook comprising 64 entries. Since vowel quality often 
depends on the length and stress of the vowel, a plurality of vowel phones for a particular 
vowel, for example, [AA], [AAl], and [AA2], are included in the exemplary codebook. 

^0 The entries in the source codebook and the target codebooks are obtained by 

recording the speech of the source speaker and the target speaker, respectively, and their 
speech into phones. According to one training approach, the source and target speakers 
are asked to utter words and sentences for which an orthographic transcription is 
prepared. The training speech is sampled at an appropriate frequency such as 16 kHz and 

15 automatically segmented using, for example, a forced aUgnment to a phonetic translation 
of the orthogr^hic transcription within an HMM framework using Mel-cepstrum 
coefScients and delta coefficients as described in more detail in C. Wightman & D. 
Talkin, The Aligner User s Manual, Entropic Reseach Laboratory, Inc., Washington, 
D.C, 1994. 

Preferably, the source and target vocal tract characteristics in the codebook entries 
are represented as line spectral frequencies (LSF). In contrast to conventional approaches 
using linear prediction coefficients (LPC) or formant frequencies, line spectral 
frequencies can be estimated quite reliably and have a fixed range useful for real-time 
digital signal processing implementation. The line spectral frequency values for the 
25 source and target codebooks can be obtained by first determining the linear predictive 

coefficients a^t for the sampled signal according to well-knovm techniques in the art. For 
example, specialized hardware, software executing on a general purpose computer or 
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microprocessor, or a combination thereof, can ascertain the linear predictive coefficients 

by such techniques as square-root or Cholesky decomposition, Levinson-Durbin 

recursion, and lattice analysis introduced by Itakura and Saito. The linear predictive 

coefficients a*, which are recursively related to a sequence of partial correlation 

p 

(PARCOR) coefficients, form an inverse fiher polynomial, A{z) = 1 - j^aj^z"* , which 

k-l 

may be augmented with +1 and -1, to produce following polynomials, wherein the angle; 
of the roots, w*, are the line spectral fi-equencies: 

/'(r) = (l-z-') n(l-2cos(w4Z-' + 2-)) (1) 

2(r) = (l + z-') f](l-2cos(w,z-' +z-')) (2) 

Preferably, a plurality of samples are taken for each source and target codebook 
entry and averaged or otherwise processed, such as taking the median sample or the 
sample closest to the mean, to produce a soxu-ce centroid vector S, and target vector 
centroid T„ respectively, where i e l.j:, and Z is size of the codebook. Line spectral 
frequencies can be converted back into linear predictive coefficients by generating a 
sequence of coefficients via polynomial P(z) and Q(z) and, thence, the Unear predictive 
coefficients ajt. 

Thus, the source codebook and the target codebook have corresponding entries 
containing speech samples derived respectively from the source speaker and the target 
speaker. Referring again to Figure 2, the hght curves in each codebook entry represent 
the (male) source speaker's voice and the dark curves in each codebook entry represent 
the (female) target speaker's voice. 

Converting Speech 
When the appropriate codebooks for the source and target speakers have been 
prepared, input speech in the source voice is transformed into the voice of the target 



wo 98/35340 ^ ^ PCT/US98/01S38 



11 



speaker, according to one embodiment of the present invention, by perfomiing the steps 
illustrated in Fig. 3. In step 300, the input speech is preprocessed to obtain an input 
speech frame. More specifically, the input speech is sampled at an appropriate frequency 
such as 16 kHz, and the DC bias is removed as by mean removal. The sampled signal is 
5 also windowed to produce the input speech frame x(n) = w{n)sin) , where w(n) is a data 
wmdowing function providing a raised cosine window, e.g. a Hamming window or a 
Hanning window, or other window such a rectangular window or a center-weighted 
window. 

In step 302, the input speech frame is converted into Une spectral frequency 
10 format. According to one embodiment of the present invention, a linear predictive 
coding analysis is first performed to determine the predication coefficients a^t for the 
input speech frame. The linear predictive coding analysis is of an appropriate order, for 
example, from an 14* order to a 30* order analysis, such as an 1 8* order or 20* order 
analysis. Based on the predication coefficients a*, a line spectral frequency vector w* is 
1 5 derived, as by the use of polynomials P(z) and Q(z), explained in more detail herein 
above. 

CoDEBOOK Weights 
Conventional voice conversions by codebook methodologies suffer from loss of 
information due to matching only to a single, "closest" source phone. Consequently, 

20 artifacts may be introduced at speech frame boundaries, leading to rough fransitions from 
one frame to the next. Accordingly, one embodiment of the invention matches the 
incoming speech frame to a weighted average of a plurality of codebook entries rather 
than to a single codebook entry. The weighting of codebook entries preferably reflects 
perceptual criteria. Use of a plurality of codebook entries smoothes the transition 

25 between speech frames and captures the vocal nuances between related sounds in the 

target speech output. Thus, in step 304, codebook weights v, are estimated by comparing 
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the input line spectral frequency vector with each centroid vector S, in the source 

codebook to calculate a corresponding distance d,: 

p 

=2]*»* Iw* -Sa l.'el.X (3) 

v/here L is the codebook size. The distance calculation includes a weight factor h*, which 
IS based on a percqjtual criterion wherein closely spaced line spectral frequency pairs, 
which are likely to correspond to formant locations, are assigned higher weights: 

min(|w, -w,_, |,|w, -Wj^, ^yk^\.-P (4) 
where AT is 3 for voiced sounds and 6 for unvoiced, since the average energy decreases 
(for voiced sounds) and increases (for unvoiced sovmds) with increasing frequency. 
Based on the calculated distances d„ the normalized codebook weights v, are obtained as 
follows: 

^ where the value of y for each frame is found by an incremental search in the range of 0.2 
to 2.0 with the criterion of minimizing the perceptual weighted distance between the 
1 5 approximated line spectral frequency vector vS^t and the input line spectral frequency 
vector wa. 

Codebook Weight Refinement 
In some apphcations, even the normaUzed codebook weights v, may not be an 
optimal set of weights that would represent the original speech spectrum. According to 
20 one embodiment of the present invention, a gradient descent analysis is performed to 

improve the estimated codebook weights v,. Referring to the flowchart illustrated in Fig. 
4, one implementation of a gradient descent analysis comprises an initialization step 400 
wherein an error value E is initialized to a very high number and a convergence constant 
7 is initialized to a suitable value from 0.05 to 0.5 such as 0.1 . 
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In the main loop of the gradient descent analysis, starting at step 402, an error 
vector e is calculated based on the distance between the approximated line spectral 
frequency vector vS and the input line spectral frequency vector w and weighted by the 
height factor h. In step 404, the error value E is saved in an old error variable oldE and 
5 new error value E is calculated from the error vector e, for example, by a sum of the 

absolute values or by a sum of squares. In step 406, the codebook weights v,- are updated 
by an addition of the error with respect to the source codebook vector eS, factored by the 
convergence constant 77 and constrained to be positive to prevent unrealistic estimates. In 
order to reduce computation according to one embodiment of the present invention, the 

10 convergence constant 7] is adjusted based on the reduction in error. Specifically, if there 
is a reduction in error, the convergence constant 77 is increased, otherwise it is decreased 
(step 408). The main loop is repeated until the reduction in error fall below an 
appropriate threshold, such as one part in ten thousand (step 410). 

It is observed that only a few codebook entries are assigned significantly large 

15 weight values in the initial weight vector estimate v. Therefore, one embodiment of the 
present invention, in order to save computation resources, updates the weights v in step 
406 only on the first few largest weights, e.g, on the five largest weights. Use of this 
gradient descent method has resulted in an additional 15% reduction in the average 
Itakura-Saito distance between the original spectra and the approximated spectra vSjk. 

20 The average spectral distortion (SD), which is a common spectral quantizer performance 
evaluation, was also reduced from L8 dB to 1.4 dB. 



Vocal Tract Spectrum Mapping 
Referring back to Figure 3, in step 306, a target vocal tract filter Via)) is 
calculated as a weighted average of the entries in the target codebook to represent the 
25 voice of the target speaker for the current speech frame. According to an embodiment of 
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the present invention, the refined codebook weights v, are applied to the target line 
spectral frequency vectors T, to construct the target line spectral frequency vector vT*: 

^,=J^y,T^,k^l..P (7) 

The target line spectral frequencies are then converted into target linear prediction 
coefficients §t, for example by way of polynomials P(z) and Q{z). The target linear 
prediction coefficients a* are in turn used to estimate the target vocal tract filter V,(co): 

1 



p 



(8) 



where /? should theoretically be 0.5. The averaging of line spectral frequencies, however, 
often results in fonnants, or spectral peaks, with larger bandwidths, which is heard as a 

1 0 buzz artifact. One approach in addressing this problem is to increase the value of fi, 
which adjusts the dynamic range of the spectrum and, hence, reduce the bandwidths of 
the formant frequencies. One disadvantage with increasing p, however, is that the 
bandv^ridth is reduced also in other frequency bands besides the formant locations, 
thereby warping the target voice spectrum. 

1 5 Accordingly, another approach is to reduce the bandwidths of the fonnants by 

adjusting the line spectral frequencies directly. The target line spectrum pairs w/ and 
iv/^, around the first F formant frequency locations ^,ye 1..F, are modified, wherein F is 
set to a smaU integer such as four (4). The source formant bandwidths and the target 
formant bandwidths b^. are used to estimate a bandwidth adjustment ratio, r\ 



r = 



(9) 



Accordingly, each pair of target line spectrum w/ and yii^, around corresponding 
formant frequency location f) is adjusted as follows: 

w/^w/-h(l-r)(f.-w/),y€L.F (10) 

and 
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w/,, ^w/:^^+(l-r)(f.-w/,,),yel.J^ (11) 
A minimum bandwidth value, eg, ^ Hz or 50Hz, may be set in order to prevent 
the estimation of unreasonable bandwidths. Fig. 5 illustrates a comparison of the target 
speech power spectrum for the [AA] vowel before (light curve 500) and after (dark curve 
5 510) the application of this bandwidth reduction technique. Reduction in the bandwidth 
of the first four formants 520, 530, 540, and 550, results in higher and more distinct 
spectral peaks. According to detailed observations and subjective listening tests, use of 
this bandwidth reduction technique has resulted in improved voice output quality. 

Excitation Characteristics Mapping 
10 Another factor that influences speaker individuality and, hence, voice conversion 

quality is excitation characteristics. The excitation can be very different for different 
phonemes. For example, voiced sounds are excited by a periodic pulse train or "buzz," 
and unvoiced sounds are excited by white noise or "hiss." According to one embodiment 
of the present invention, the linear predictive coding residual is used as an approximation 
15 of the excitation signal. In particular, the linear predictive coding residuals for each entry 
in the source codebook and the target codebook are collected as the excitation signals 
firom the training data to compute a corresponding short-time average discrete Fourier 
analysis or pitch-synchronous magnitude spectrum of the excitation signals. The 
excitation spectra are used to formulate excitation transformation spectra for entries of 
20 the source codebook, U; (a>) , and the target codebook, U J (a?) . Since linear predictive 
coding is an all-pole model, the formulated excitation transformation filters serve to 
transform the zeros in the spectrum as well, thereby further improving the quality of the 
voice conversion. 

Referring back to Figure 3, in step 308, the excitations in the input speech 
25 segment are transformed firom the source voice to the target voice by the same codebook 
weights v, used in transfomiing the vocal tract characteristics. Specifically, an overall 



0^ 
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excitation filter is constructed as a weighted combination of the excitation codebook 
excitation spectra: 

H (Q)) = y V . ' ^ (12) 

According to one embodiment of the present invention, the overall excitation 
5 filter Hg{o)) is applied to the linear predictive coding residual e(«) of the input speech 
signal x{n) to produce a target excitation filter: 

G,(fi>) = //^(ty)DFT{e(«)} (13) 

where the linear predictive coding residual e(«) is given by: 

p 

ein) = x{n)-Y,^iX{n-k) (14) 

1 0 Both the vocal tract characteristics and the excitations characteristics are 

transformed in the same computational firework, by computing a weighted average of 
codebook entries. Accordingly, this aspect of the present invention enables the 
incorporation of excitation characteristics within a voice conversion system in a 
computationally tractable manner. 



25 



^ ^ Target Speech Filter 

Referring again to Fig. 3, in step 310, a target speech filter Y{(o) is on the basis of 
the vocal tract filter V^a>) and, in some embodiments of the present invention, the 
excitation filter G^oi). According to one embodiment, target speech filter Y{<o) is defined 
as the the excitation filter G^ta) followed by the vocal tract filter Vi<o)\ 

Y{a>-)^GXo>)V,{<D). (15) 

In accordance with another embodiment ofthe present invention, further ^ 
refinement to the construction of the target speech filter Y{(o) may be desirable for 
improved handling of unvoiced sounds. The incoming speech spectrum X{a>), derived 
fi-om the sampled and windowed input speech x(n), can be represented as 

^(fl)) = G.(fl>)K.(ty), (16) 
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where G^C^y) and Vs(a)) represent the source speaker excitation and vocal tract spectrum 
filters, respectively. Consequently, the target speech spectrum filter 7(6?) can be 
formulated as: 



(17) 



Using the overall excitation filter Hgico) as an estimate of the excitation filter, the 
target speech spectrum filter Y{{d) becomes: 



Y(a>)^H(co) 



^io}) (18) 



When the amount of the training data is small or when the accuracy of the 
segmentation in question, unvoiced segments are difficult to represent accurately, thereby 
leading to a mismatch in the soxirce and target vocal tract fihers. Accordingly, one 
embodiment of the present invention, estimates a source speaker vocal tract spectrum 
filter Vs{co) differently for voiced segments and for unvoiced segments. For voiced 
segments, the source speaker vocal tract spectrum filter Vsicd) is replaced with the 
spectrum derived firom the original linear predictive coefficient vector ajt: 
15 v,(a>)^^-^ (19) 



On the other hand, the linear predictive vector approximation coefficients, derived fi-om 
the codebook weighted line spectral firequency vector approximation vS*, is used to 
determine the source speaker vocal tract spectrum filter Vs{co) for unvoiced segments. 

In step 3 12, the result of applying Y(a)) for the current segment is post processed 
into a time-domain target signal in the voice of the target speaker. More specifically, an 
inverse discrete Foiuier transform is applied to produce the synthetic target voice: 

>;(n) = Re{IDFT{y(fi>)}}. (20) 
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Prosody Transformation 
According to one embodiment of the present invention, prosodic transformations 
may be applied to the frequency domain target voice signal Y(co) before post processing 
into the time domain. Prosodic transformations allow the target voice to match the 
5 source voice in pitch, duration, and stress. For example, a pitch scale modification factor 
J3 at each frame can be set as 

^ = - f . (21) 

Jo 

where a; is the source pitch variance, cr; is the target pitch variance,^^ is the source 
speaker fundamental jfrequency, /i^ is the source mean pitch value, and /it is the target 
1 0 mean pitch value. For duration characteristics, a time-scale modification factor 7^ can be 
set according to the same codebook weights: 

r^^i^ijr^ (22) 

where dj is the average source speaker duration and d! is the average target speaker 
duration. For the speakers' stress characteristics, an energy-scale modification factor rj 
1 5 can be set according to the same codebook weights: 

^ = Zv,5-, (23) 

where e/ is the average source speaker RMS energy and el is the average target speaker 
RMS energy. 

The pitch-scale modification factor >9, the time-scale modification factor and the 
20 energy scaling factor rj are applied by an appropriate methodology, such as within a 

pitch-synchronous overlap-add synthesis fi-amework, to perform the prosodic synthesis 
One overlap-add synthesis methodology is explained in more detail in the commonly 

assigned Application Ser. No. entitled "Prosody Modification System and 

Methodology," filed concurrently by Francisco M. Gimenez de los Galenes, the contents 
25 of which are herein incorporated by reference. 
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While this invention has been described in connection with what is presently 
considered to be the most practical and preferred embodiment, it is to be understood that 
the invention is not limited to the disclosed embodiment, but on the contrary, is intended 
to cover various modifications and equivalent arrangements included within the spirit and 
5 scope of the appended claims. 
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CLAIMS 



WHAT IS CLAIMED IS: 



1 1 . A method of transforming a source signal representing a source voice into a target 

2 signal representing a target voice, said method comprising the machine-implemented 

3 steps of: 

4 preprocessing said source signal to produce a source signal segment; 

5 comparing the source signal segment with a plurality of source codebook entries 

6 representing speech units in said source voice to produce therefrom a plurality of 

7 corresponding weights; 

8 transforming the source signal segment into a target signal segment based on the 

9 plurality of weights and a plurality of target codebook entries representing speech 

10 units in said target voice, said target codebook entries corresponding to the 

11 plvirality of source codebook entries; and 

1 2 post processing the target signal segment to generate said target signal. 

1 2. A method as in claim 1 , wherein the step of preprocessing said source signal 

2 mcludes the step of sampling said source signal to produce a sampled source signal. 

1 3. A method as in claim 2, wherein the step of preprocessing said source signal 

2 includes the step of segmenting said sampled source signal to produce the source signal 

3 segment. *^ ^ 



1 

2 



4. A method as in claim 1, wherein the step of comparing the source signal segment 
to produce therefrom a plurality of corresponding weights includes the step of comparing 
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the source signal segment to produce therefrom a plurality of corresponding perceptual 
weights. 

5. A method as in claim 1 , wherein the step of comparing the source signal segment 
includes the steps of: 

converting the source signal segment into a pluraUty of line spectral frequencies; and 
comparing the plurality of line spectral frequencies with the plurality of the source 
code entries to produce therefrom the plurality of the respective weights, wherein 
each of the source code entries include a respective plurality of line spectral 
frequencies. 

6. A method as in claim 5, wherein the step of converting the source signal segment 
includes the steps of: 

determining a plurality of coefficients for the source signal segment; and 
converting the plurality of coefficients into the plurality of line spectral frequencies. 

7. A method as in claim 6, wherein the step of determining a plurality of coefficients 
includes the step of determining a plurality of linear prediction coefficients or PARCOR 
coefficients. 

8. A method as in claim 5, wherein the step of comparing the plurality of line 
spectral frequencies includes the steps of: 

computing a plurality of distances between the source signal segment, represented by 
the plurality of line spectral frequencies, and each of the plurality of the respective 
source code entries, represented by a respective plurality of line spectral 
frequencies; and 

producing the plurality of the weights based on the plurality of respective distances. 
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1 9. A method as in claim 8, further including the step of refining the plurality of 

2 weights by a gradient descent method. 

1 10. A method as in claim 1, wherein the step of transforming the source signal 

2 segment into a target signal segment based on the plurality of weights and a plurality of 

3 target codebook entries includes the step of transforming vocal tract characteristics of the 

4 source signal segment into the target signal segment based on the plurality of weights and 

5 a plurality of target codebook entries. 

1 11 . A method as in claim 1 0, wherein the step of transforming vocal tract 

2 characteristics includes the step of reducing formant bandwidths in the target signal 

3 segment. 

1 12. A method as in claim 10, wherein the step of transforming the source signal 

2 segment into a target signal segment based on the plurality of weights and a plurality of 

3 target codebook entries includes the step of transforming excitation chMactraistics of the 

4 source signal segment into the target signal segment based on the plurality of weights. 

1 13. A method as in claim 1, further including the step of modifying the prosody of 

2 the target signal segment based on the plurality of wei^ts. 

1 14. A method as in claim 13, wherein the step of modifying the prosody of the target 

2 signal segment based on the plurality of weights includes the step of modifymg the 

3 duration ofthe target signal segment. 
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1 1 5 . A method as in claim 13, wherein the step of modifying the prosody of the target 

2 signal segment based on the plurality of weights includes the step of modifying the stress 

3 of the target signal segment. 

1 1 6. A computer-readable medium bearing instructions for transforming a source 

2 signal representing a source voice into a target signal representing a target voice, said 

3 instructions arranged, when executed, to cause one or more processors to perform the 

4 steps of: 

5 preprocessing said source signal to produce a source signal segment; 

6 comparing the source signal segment with a plurality of source codebook entries 

7 representing speech units in said source voice to produce therefrom a plurality of 

8 corresponding weights; 

9 transforming the source signal segment into a target signal segment based on the 

10 plurality of weights and a plurality of target codebook entries representing speech 

11 »™its in said target voice, said target codebook entries corresponding to the 

12 plurality ofsource codebook entries; and 

1 3 post processing the target signal segment to generate said target signal. 

1 1 7. A computer-readable medium as in claim 1 6, wherein the step of preprocessing 

2 said source signal includes the step of sampUng said source signal to produce a sampled 

3 source signal. 

1 1 8. A computer-readable medium as in claim 1 7, wherein the step of preprocessing 

2 said source signal includes the step of segmenting said sampled source signal to produce 

3 the soiirce signal segment. 
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1 19. A method as in claim 16, wherein the step of comparing the source signal 

2 segment to produce therefrom a pluraUty of corresponding weights includes the step of 

3 comparing the source signal segment to produce therefrom a plurahty of corresponding 

4 perceptual weights. 

1 20. A computer-readable medium as in claim 16, wherein the step of comparing the 

2 source signal segment includes the steps of: 

3 converting the source signal segment into a plurality of line spectral frequencies; and 

4 comparing the plurality of line spectral frequencies with the plurahty of the source 

5 code entries to produce therefrom the plurality of the respective weights, wherein 

6 each of the source code entries include a respective plurality of line spectral 

7 frequencies. 

1 21 . A computer-readable medium as in claim 20, wherein the step of converting the 

2 source signal segment includes the steps of: 

3 determining a plurality of coefficients for the source signal segment; and 

4 converting the plurality of coefficients into the pluraUty of line spectral frequencies. 

1 22, A computer-readable medium as in claim 2 1 , wherein the step of determining a 

2 plurality of coefficients includes the step of determining a plurality of Unear prediction 

3 coefficients or PARCOR coefficients. 

1 23. A computer-readable medium as in claim 20, wherein the step of comparing the 

2 plurality of line spectral frequencies includes the steps of: 

3 computing a plurality of distances between the source signal segment, represented by 

4 the plurality of Une spectral frequencies, and each of the plurality of the respective 
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5 source code entries, represented by a respective plurality of line spectral 

6 frequencies; and 

7 producing the plurality of the weights based on the plurality of respective distances. 

1 24. A computer-readable medium as in claim 23, further including the step of 

2 refining the plurality of the weight by a gradient descent method. 

1 25. A computer-readable medium as in claim 16, wherein the step of transfomiing 

2 the source signal segment into a target signal segment based on the plurality of weights 

3 and a plurality of target codebook entries includes the step of transforming vocal tract 

4 characteristics of the source signal segment into the target signal segment based on the 

5 plurality of weights and a plurality of target codebook entries. 

1 26. A computer-readable medium as in claim 25, wherein the step of transforming 

2 vocal tract characteristics includes the step of reducing formant bandwidths in the target 

3 signal segment. 

1 27. A computer-readable medium as in claim 25, wherein the step of transforming 

2 the source signal segment into a target signal segment based on the plurality of weights 

3 and a plurahty of target codebook entries includes the step of transforming excitation 

4 characteristics of the source signal segment into the target signal segment based on the 

5 plurality of weights. 

1 28. A computer-readable medium as in claim 16, wherein the instructions, when 

2 executed, are further arranged to perform the step of modifying the prosody of the target 

3 signal segment based on the plurality of weights. 
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1 29. A computer-readable medivim as in claim 28, wherein the step of modifying the 

2 prosody of the target signal segment based on the plurality of weights includes the step of 

3 modifying the duration of the target signal segment. 



1 

2 



30. A computer-readable medium as in claim 28, wherein the step of modifying the 
prosody of the target signal segment based on the plurality of weights includes the step of 



3 modifying the stress of the target signal segment. 
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Figure 3 
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